A Survey of Zero-Shot Learning：Settings, Methods, and Applications

2020-04-16

零样本学习方法（ZSL,Zero-shot Learning）是学术界重要的前沿研究分支之一。尽管大量科研相关人员已经标注了很多标准数据集，但就算著名如ImageNet，在其千万级数据集中也不过分为21841个类别，现实世界中已经标注的数据仍然只占少数，且有诸多场景如疾病图像数据难以大量获取。故研究在目标域无标注数据情况下如何进行有效的学习并进行预测非常有意义。本文对ZSL问题做了明确的定义，根据模型训练过程中数据的使用方式将ZSL问题分为三类，并且描述了现有工作中标签语义空间的几种建模方法，最后给出了若干有代表性的ZSL方法。

paper: https://drive.google.com/open?id=1mX1l3AhXz20gIajLjCRso6JMZWLRIxTB
source: ACM Trans.

Overview of Zero-Shot Learning

Definition of Zero-Shot Learning

零样本学习方法（ZSL），其定义是基于可见(seen)标注数据集及可见(seen)标签集合，学习并预测不可见(unseen,无标注)数据集结果，其中unseen标签集合是可获得的。Seen标签集合与unseen标签集合交集为空。

Different Learning Settings

针对于数据选择操作模式，本文将ZSL研究分为三种：

CIII：Class-Inductive Instance-Inductive setting：只使用训练实例和seen 标签集合来训练模型；
CTII：Class-Transductive Instance-Inductive setting：使用训练实例和seen 标签集合,外加unseen标签集合来训练模型；
CTIT：Class-Transductive Instance-Inductive setting：使用训练实例和seen 标签集合,外加unseen标签集合，对应未标注的测试集合来训练模型。

Semantic Spaces

As no labeled instances belonging to the unseen classes are available, to solve the zero-shot learning problem, some auxiliary information is necessary.

文章将标签语义空间分为以下两类：

Engineered Semantic Spaces：专家手工定义的特征空间
- Attribute spaces：基于类别属性构造类别表征
- Lexical spaces：基于词汇特征
- Text-keyword spaces：基于关键词，例如tfidf特征
Learned Semantic Spaces：自学习的类别表征
- Label-embedding spaces：基于标签类别名称的word embedding
- Text-embedding spaces：基于标签类别描述的预训练编码
- Image-representation spaces：基于预训练模型提取的图像表征

Methods

文章将现有方法归为两类：

Classifier-Based Methods：直接学习不可见类别样本的分类器
Instance-Based Methods：先获取不可见类别样本的标注，然后再构建分类器

Classifier-Based Methods

现有的Classifier-Based Methods往往学习多个one-versus-rest的二分类器，分为以下三类方法：

Correspondence Methods：通过类别表征和对应one-versus-rest的二分类器之间的相关性来构建不可见类别的分类器。
Relationship Methods：基于类别之间的关系来构建分类器。
Combination Methods：

Correspondence Methods

In the semantic space, for each class, there is just one corresponding prototype. Thus, this prototype can be regarded as the “representation” of this class. Meanwhile, in the feature space, for each class, there is a corresponding binary one-versus-rest classifier, which can also be regarded as the “representation” of this class. Correspondence methods aim to learn a correspondence function between these two types of “representations.”

具体过程：给定已知训练数据，训练correspondence function $\omega_{i}=\varphi\left(\mathbf{t}_{i} ; \zeta\right)$，$\omega_{i}$ 是二分类器$f_{i}\left(\cdot ; \omega_{i}\right)$ 的参数，$\mathbf{t}_{i}$ 是第i个类别表征；然后，针对于不可见类别，由correspondence function构建对应的二分类器。

基于语义embedding——DeViSE方法[2]
Frome, A. , Corrado, G. S. , Shlens, J. et al.(2013)提出DeViSE方法，是一个基于度量学习解决ZSL问题的基线方法，如下图所示。

图中展示该方法是通过embedding匹配的方式完成unseen图像与标签的匹配，也就是分类结果。对于图像，进行普通的分类模型训练，对测试集图片输出其embedding表示；对于标签，由于是文本可以利用语言模型训练来获取标签文本对应的embedding，通过相似性匹配方法获取任意图片的对应标签，其损失函数设计为 hinge rank loss:

Relationship Methods

Its insight is to construct a classifier for the unseen classes based on the relationships among classes.

具体过程：首先基于可见类别训练多个二分类器，然后使用类别表征捕获可见类别与不可见类别之间的关系，最终结合二者预测测试样本。

基于语义类间关系的SIL[3]
SIL的核心是通过建立可见类别与不可见类别之间的语义关联，进而来预测未知类别的样本。论文使用wordnet和word embedding两种方法来衡量类别之间的语义相关性，然后对可见类别的分类器加权求和：
$$w_{u}=\sum_{K} \mathrm{w}_{k} s_{u k}$$

Combination Methods

Its insight is to construct the classifier for unseen classes by the combination of classifiers for basic elements that are used to constitute the classes.

每个类别由基本属性组成。

基于构建属性语义空间(attribute semantic space)
Li, Y. , Zhang, J. , Zhang, J. , & Huang, K. (2018)也提出了一种改进方案：通过人工构建一个描述图像的属性集合构建属性语义空间，不但能提供更细致准确的语义表示，而且缩小了实例域和标签域的空间差异，一举两得，如下图所示。

方案中有两个模型，分别对应属性集合和标签集合，学习到的增广矩阵分为两部分：user-defined attributes (UA)，and latent discriminative attributes (LA)。每个模型分别学习匹配实例和标签编码，实例和便签的潜在语义，结果拼接为一个矩阵，属性子阵部分计算对数loss，潜在语义子阵按照hinge rank loss计算。

Instance-Based Methods

Instance-based methods aim to first obtain labeled instances for the unseen classes, then with these instances to learn the zero-shot classifier $f^{u}(·)$.

现有方法分为三类：

projection methods
instance-borrowing methods
synthesizing methods

Projection Methods

Its insight is to obtain labeled instances for the unseen classes by projecting both the feature space instances and the semantic space prototypes into a common space.

具体过程：首先将可见类别的样本和类别表征映射到相同的维度空间，然后在此空间内做分类。

Hubness and Pollution: Delving into Cross-Space Mapping for Zero-Shot Learning[4]
文章使用了线性映射函数将样本特征映射到类别特征，然后使用最小均方损失或者max-margin损失来优化，在预测不可见类别阶段直接使用最近邻算法。
$$\hat{\mathbf{W}}=\underset{\mathbf{W} \in \mathbb{R}^{d 1 \times d 2}}{\operatorname{argmin}}|\mathbf{X W}-\mathbf{Y}|+\lambda|\mathbf{W}|$$
$$\sum_{j \neq i}^{k} \max \left\{0, \gamma+\operatorname{dist}\left(\hat{\mathbf{y}}_{i}, \mathbf{y}_{i}\right)-\operatorname{dist}\left(\hat{\mathbf{y}}_{i}, \mathbf{y}_{j}\right)\right\}$$

Instance-Borrowing Methods

Its insight is to obtain labeled instances for the unseen classes by borrowing from the training instances.

Zero-shot learning with transferred samples [5]
核心流程如下：

训练阶段：基于label embedding构建类间相似度，然后计算可见类别样本相对于不可见类别的迁移性，赋予目标类别的伪标签，重新训练不可见类别分类器（论文里使用的是SVM）。
测试阶段：直接使用分类器预测。

Synthesizing Methods

Its insight is to obtain labeled instances for the unseen classes by synthesizing some pseudo instances.

A generative adversarial approach for zero-shot learning from noisy texts [6]
文章的核心是通过对抗的方法，基于不可见类别的文本描述生成视觉特征，进而合成一些伪标注数据，与Instance-Borrowing Methods相比，此类方法伪造的是输入样本特征X，而非伪标签Y。

Generative Adversarial Zero-Shot Relational Learning for Knowledge Graphs [7]
这篇AAAI 2020的文章想法很类似，也是使用GAN从文本描述中得到良好的输入特征表示。

Zero-shot visual recognition using semantics-preserving adversarial embedding networks [8]
本文主要是解决样本编码中可能存在的信息损失问题，例如对训练集中已知类别的样本做分类，模型可能侧重于一些特定有区分性的属性，而忽略了其他信息，在测试集预测不可知类别样本时往往会失败。主要方法是引入一个自编码器和对抗训练，使得分类特征和重建特征有比较好的相似性（也可以称之为交互）。

Summary

Future Directions

Characteristics of input data：要充分利用研究任务的特点
Selection of training data：
- Heterogeneous training and testing data：目前的ZSL任务训练集和测试集都属于相同的数据类型和语义空间，但实际中会存在异构的训练数据和测试数据。
- Actively selecting training data：与主动学习、课程学习相结合
Selection and maintenance of auxiliary information：附加信息的选择和利用是ZSL的核心
more realistic and application-specific problem settings：ZSL的设定不同会产生不同的问题，实际中更多的是开放标签空间问题，例如可以使用GCN获取类别之间的结构关联。
Theoretical guarantee
Combination with other learning paradigms：与few-shot的结合

Reference

[1] Wei Wang, Vincent W. Zheng, Han Yu, and Chunyan Miao.(2019). A Survey of Zero-Shot Learning: Settings, Methods, and Applications. ACM Trans. Intell. Syst. Technol.10, 2, Article 13 (January 2019), 37 pages.

[2] Frome, A. , Corrado, G. S. , Shlens, J. , Bengio, S. , Dean, J. , & Ranzato, M. , et al. (2013). DeViSE: a deep visual-semantic embedding model. Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. Curran Associates Inc.

[3] Chuang Gan, Ming Lin, Yi Yang, Yueting Zhuang, and Alexander G. Hauptmann. 2015. Exploring semantic interclass relationships (SIR) for zero-shot action recognition. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI’15). 3769–3775.

[4] Angeliki Lazaridou, Georgiana Dinu, and Marco Baroni. 2015. Hubness and pollution: Delving into cross-space mapping for zero-shot learning. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL’15). 270–280.

[5] Yuchen Guo, Guiguang Ding, Jungong Han, and Yue Gao. 2017. Zero-shot learning with transferred samples. IEEE Transactions on Image Processing 26, 7 (2017), 3277–3290.

[6] Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. 2018. A generative adversarial approach for zero-shot learning from noisy texts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).

[7] Qin, Pengda, Xin Wang, Wenhu Chen, Chunyun Zhang, Weiran Xu and William Yang Wang. “Generative Adversarial Zero-Shot Relational Learning for Knowledge Graphs.” _ArXiv_ abs/2001.02332 (2020): n. pag.

[8] Long Chen, Hanwang Zhang, Jun Xiao, Wei Liu, and Shih-Fu Chang. 2018. Zero-shot visual recognition using semantics-preserving adversarial embedding networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).

Helic He

NLP